Skip to content

cuda: reset cuda context after reading memory size#23935

Open
0cc4m wants to merge 5 commits into
masterfrom
0cc4m/cuda-get-memory-device-reset
Open

cuda: reset cuda context after reading memory size#23935
0cc4m wants to merge 5 commits into
masterfrom
0cc4m/cuda-get-memory-device-reset

Conversation

@0cc4m
Copy link
Copy Markdown
Contributor

@0cc4m 0cc4m commented May 31, 2026

Overview

Alternative to #23604, to allow reading CUDA memory in the router process in #21231 without allocating permanent memory through an initialized CUDA context. Instead of using NVML, this checks before running cudaMemGetInfo whether the context is already initialized. If not, it releases the context after the call.

I tried ref-counting as well as suggested in #23604 (comment), but that is harder to get right and introduces more edge cases.

Requirements

@0cc4m 0cc4m requested a review from a team as a code owner May 31, 2026 06:37
@github-actions github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 31, 2026
@ORippler
Copy link
Copy Markdown
Collaborator

ORippler commented Jun 2, 2026

Alternative to #23604, to allow reading CUDA memory in the router process in #21231 without allocating permanent memory through an initialized CUDA context. Instead of using NVML, this checks before running cudaMemGetInfo whether the context is already initialized. If not, it releases the context after the call.

How often will the router process query the available memory? If it's only once at the beginning, I'd suggest to do a pattern like

ggml_backend_init -> ggml_backend_device_i.get_memory -> ggml_backend_i.free.

Intuitively, I'd have thought a backend has to be initialized before we can ask it about its available memory. Consequentially, we would move the release of the cuda context into ggml_backend_cuda_free

@0cc4m
Copy link
Copy Markdown
Contributor Author

0cc4m commented Jun 2, 2026

The behaviour of initialisation just to read memory state seems to be unique to CUDA, so I would prefer to handle it inside of the CUDA backend, not outside.

Comment thread ggml/src/ggml-cuda/ggml-cuda.cu Outdated
@JohannesGaessler
Copy link
Copy Markdown
Contributor

@ORippler as of right now fetching memory is part of the ggml backend device API (== CUDA device), not the ggml backend API (== CUDA stream). So the lifetime of the CUDA device context cannot be simply tied to the lifetime of a ggml backend unless we move that function. And I would not be in favor of this since the memory in my opinion belongs to the device.

@ORippler
Copy link
Copy Markdown
Collaborator

ORippler commented Jun 3, 2026

And I would not be in favor of this since the memory in my opinion belongs to the device.

Still unintuitive to me: what good is a device if I don't have the constructs/context in place to dispatch work to it. But maybe I'm too biased by CUDA on this one 🤷‍♂️

@0cc4m 0cc4m force-pushed the 0cc4m/cuda-get-memory-device-reset branch from a182b35 to 61e659a Compare June 4, 2026 11:44
@0cc4m 0cc4m requested a review from IMbackK as a code owner June 4, 2026 11:44
@0cc4m
Copy link
Copy Markdown
Contributor Author

0cc4m commented Jun 4, 2026

I forgot that hip and musa were also initially included here, I don't think that is required, so I'll remove it.

Edit: On second thought, that would require wrapping all counter calls in preprocessor checks. Not sure whether that would be better here.

@0cc4m 0cc4m removed the request for review from IMbackK June 4, 2026 11:54
@0cc4m
Copy link
Copy Markdown
Contributor Author

0cc4m commented Jun 4, 2026

I excluded hip and musa, sorry about the noise. @JohannesGaessler Let me know if this looks better.

Comment thread ggml/src/ggml-cuda/ggml-cuda.cu Outdated
Comment thread ggml/src/ggml-cuda/ggml-cuda.cu Outdated
Copy link
Copy Markdown
Contributor

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@JohannesGaessler
Copy link
Copy Markdown
Contributor

I forgot: Ruben and me had an internal discussion about the implementation. The changes in this PR are still unsafe in combination with -sm row but that is now obsolete and I'll make a PR to remove it in the CUDA backend and simplify the code for cuBLAS.

@0cc4m
Copy link
Copy Markdown
Contributor Author

0cc4m commented Jun 6, 2026

@ggml-org/ggml-cuda Can I get another review?

@0cc4m
Copy link
Copy Markdown
Contributor Author

0cc4m commented Jun 6, 2026

I noticed the backend_free function was still moved because it needed access to the device struct, but that was no longer necessary because the struct was moved as well. I changed it back, gonna wait for CI again now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants